Structured Prediction

🏠 ⮐ Artificial Intelligence ⮐ Machine Learning ⮐ Supervised Learning ⮐

Core Concept

Structured prediction is a supervised learning task where the goal is to predict outputs that have rich internal structure—sequences, trees, graphs, or other composite objects—rather than a single label or scalar. Given an input (e.g. a sentence, an image, a graph), the model produces a structured object that satisfies constraints and dependencies among its parts (e.g. a tag per word, a parse tree, a segmentation mask). Unlike classification (one label per instance) or regression (one number), structured prediction deals with exponentially or infinitely large output spaces where the structure is both a constraint and a source of inductive bias. Examples include part-of-speech tagging (sequence of tags), named entity recognition (spans and labels), syntactic or dependency parsing (trees or labeled arcs), and image segmentation (label per pixel or region).

The Learning Process

The structured prediction process involves training on examples where both inputs and target structures are known. The model typically defines a score or energy over candidate structures and is trained to assign higher scores to correct structures (e.g. via margin-based loss, conditional likelihood, or structured hinge loss). Because the output space is huge, training and inference require efficient algorithms that exploit the structure—dynamic programming for sequences and trees, message passing for graphs, or approximate search. During inference, the model searches for the highest-scoring valid structure (e.g. Viterbi for sequences, CKY or chart parsing for trees); exact decoding is tractable when the structure admits a compact decomposition (e.g. chain or context-free tree), otherwise approximate decoding (beam search, sampling) is used.

Structure Types

_{Common forms of structured output and how they are represented}

Sequences – Ordered list of labels (e.g. one label per token); dependencies are often local (e.g. Markov) or modeled with recurrent or attention-based networks. Used in sequence labeling (POS, NER, chunking), alignment, and many sequence-to-sequence tasks.
Trees – Hierarchical structure (constituency trees, dependency trees); grammar or transition systems define valid trees. Parsing produces trees; decoding uses dynamic programming (CKY, Eisner) or transition-based parsers.
Graphs and segmentations – Output is a graph (e.g. relation extraction, scene graph) or a partition of the input (e.g. image segmentation, coreference clusters). Often addressed with graph neural networks, CRFs on graphs, or iterative refinement.
Joint structure – Multiple interacting substructures (e.g. joint entity and relation extraction, joint parsing and NER); requires joint inference or pipeline with consistency constraints.

Inference and Decoding

_{How the model produces a single best (or approximate) structure}

Exact decoding – When the scoring function decomposes over the structure (e.g. sum of local scores), dynamic programming yields the global optimum—e.g. Viterbi for linear-chain CRFs, CKY for context-free parsing. Tractable for chains and many grammars.
Beam search – Maintain a small set of partial hypotheses and extend them incrementally; approximate when exact search is too expensive. Common in neural sequence and tree models.
Transition-based decoding – Build the structure with a sequence of actions (e.g. shift, reduce, arc-add); a classifier or scorer predicts the next action. Flexible and often fast; may not guarantee globally optimal structure.
Sampling and reranking – Sample candidate structures from a model (or generator) and rerank with a more expensive scorer; useful when the best structure is hard to compute exactly.

Evaluation

_{Measuring quality of predicted structures}

Exact match / accuracy – Fraction of instances where the full structure is correct; strict and often low for long sequences or complex trees.
Token-/span-level metrics – For sequences: precision, recall, F1 over labels or spans (e.g. NER F1). For trees: labeled attachment score (LAS), unlabeled (UAS), or bracket F1 for constituency.
Ranking and calibration – When the model scores multiple candidates, metrics may consider whether the gold structure is ranked highly or has high probability.

Common Challenges

Large output space: the set of valid structures is huge or infinite, so exhaustive enumeration is impossible; models must rely on factorization and efficient search. Non-decomposable losses: task metrics (e.g. F1, NDCG) often do not decompose over the structure, so training uses surrogate losses (likelihood, hinge) that are tractable. Data and annotation: labeled structures are expensive (e.g. treebanks, segmentation masks); semi-supervised and weak supervision are common. Exposure bias: when training uses gold structure but inference uses model-generated structure, errors compound; mitigation includes scheduled sampling or reinforcement learning. Consistency and constraints: outputs must satisfy hard constraints (e.g. valid trees, non-overlapping spans); decoding must respect them, and training can incorporate constraint-based or Lagrangian methods.

Sub-types

Structured prediction sub-types are distinguished by the form of the output structure (sequence of labels vs tree vs graph) and the type of dependency and decoding used.

Sequence Labeling – Assigning a label to each element of a sequence (e.g. POS tag per word, BIO label per token for NER). Output is a sequence of labels with local or long-range dependencies.
Parsing – Predicting a syntactic tree (constituency or dependency) over a sentence. Output has hierarchical structure and must satisfy grammatical constraints.